a) data selection, coreset selection, dataset pruning: select a subset of training data
- survey on coreset selection: https://arxiv.org/pdf/2505.17799
some papers of dataset pruning for generative model:
* Li, Yize, et al. "Pruning then reweighting: Towards data-efficient training of diffusion models." ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025. * Moser, Brian B., Federico Raue, and Andreas Dengel. "A study in dataset pruning for image super-resolution." International Conference on Artificial Neural Networks. Cham: Springer Nature Switzerland, 2024.dataset quantization: divide the training set into different bins and select representative samples in each bin. It could be used as a data selection strategy.
data attribution: survey on data attribution: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=5451054. It could be used as a measurement for data selection. Some papers of data attribution for generative model:
* Georgiev, Kristian, et al. "The journey, not the destination: How data guides diffusion models." arXiv preprint arXiv:2312.06205 (2023). * Zheng, Xiaosen, et al. "Intriguing properties of data attribution on diffusion models." arXiv preprint arXiv:2311.00500 (2023). * Lin, Jinxu, et al. "Diffusion Attribution Score: Evaluating Training Data Influence in Diffusion Models." arXiv preprint arXiv:2410.18639 (2024).
b) dataset distillation: optimize the training set, the optimized training images are not realistic images. There is no work using distilled images to train generative model.